Enforce DataFrame display memory limits with `max_rows` + `min_rows` constraint (deprecate `repr_rows`) by kosiew · Pull Request #1367 · apache/datafusion-python

kosiew · 2026-02-04T06:57:47Z

Which issue does this PR close?

Closes Display max memory size is not respected if repr_rows < min_rows #1362.

Rationale for this change

Large DataFrames could ignore the configured max_memory_bytes limit during display.

Previously the defaults (repr_rows=10, min_rows_display=20) meant the collection loop condition rows_so_far < min_rows stayed true even after exceeding the memory budget, causing significantly more data to be streamed/collected than intended.

This PR resolves that by:

Introducing a clearer max_rows setting (replacing repr_rows).
Enforcing the invariant that the guaranteed minimum (min_rows_display) cannot exceed the maximum rows cap.
Adding a deprecation path for repr_rows so existing users aren’t broken immediately.

What changes are included in this PR?

Docs: Update user guide examples to use max_rows instead of repr_rows.
Python formatter API:
- Add max_rows as the primary configuration for limiting displayed rows.
- Keep repr_rows as a deprecated alias (constructor arg + property), emitting DeprecationWarning.
- Add centralized validation via _validate_formatter_parameters():
  - All numeric args must be positive.
  - Enforce min_rows_display <= max_rows.
  - Reject ambiguous configs where both repr_rows and max_rows are provided with different values.
- Store resolved value internally as _max_rows and expose max_rows / deprecated repr_rows properties.
- Add max_rows to configure_formatter() allowed keys.
Rust display/streaming logic:
- Rename config field repr_rows -> max_rows.
- Update defaults (min_rows 20 → 10) to avoid violating the min/max relationship.
- Validate min_rows <= max_rows.
- Update the streaming loop to collect until (memory && max_rows) or until the guaranteed min_rows is reached, with clearer comments.
- Continue to proportionally reduce rows when memory is exceeded while still respecting the minimum rows guarantee.

Are these changes tested?

Yes.

Updated existing formatter tests to use max_rows.
Added new tests for:
- Memory-limit boundary conditions (tiny budget, default budget, large budget, and min-rows override).
- repr_rows backward compatibility:
  - Emits DeprecationWarning when used.
  - Resolves correctly to max_rows.
  - Errors when conflicting with an explicit max_rows.
- Validation failures for invalid max_rows and for min_rows_display > max_rows.

Are there any user-facing changes?

Yes.

New option: max_rows is now the preferred way to cap rows displayed in repr/HTML output.
Deprecation: repr_rows is deprecated and will emit a DeprecationWarning.
- Existing code using repr_rows continues to work.
- Providing both repr_rows and max_rows with different values raises a ValueError.
Behavioral change: Default minimum rows displayed changes from 20 to 10.
Docs: Updated examples and clarified that min_rows_display must be <= max_rows.

If the deprecation/rename is considered a public API change, please add the api change label.

LLM-generated code disclosure

This PR includes code and comments generated with assistance from an LLM. All LLM-generated content has been manually reviewed and tested.

…and adjust default values

…pdate related validations

…olve max_rows logic

…formatter memory limits

timsaucer

Thank you for taking this on. I think the new solution is very nice!

timsaucer · 2026-02-04T12:03:44Z

docs/source/user-guide/dataframe/rendering.rst

        min_rows_display=20,       # Minimum number of rows to display
-        repr_rows=10,              # Number of rows to display in __repr__


It looks like the default here has min_rows > max_rows. Also should we have consistent naming of the two? Either min_rows and max_rows or min_rows_display and max_rows_display?

I think the _display was differentiating what happens during a display() call vs __repr__ but I think these values get used during both calls.

I'll change to use min_row, max_rows.

timsaucer · 2026-02-04T12:04:24Z

docs/source/user-guide/dataframe/rendering.rst

        min_rows_display=50,               # Always show at least 50 rows
-        repr_rows=20                       # Show 20 rows in __repr__ output


Same as above, difference between _display and without the trailer and also we have here min_rows > max_rows

timsaucer · 2026-02-04T12:09:04Z

python/datafusion/dataframe_formatter.py

-            Minimum number of rows to display.
-        repr_rows : int, default 10
-            Default number of rows to display in repr output.
+        min_rows_display : int, default 10


It's not about this PR per se, but maybe this is an opportunity to tighten up the comments here. We're repeating ourselves with the types and defaults. Those are already in the type hints. I think it's becoming customary to not duplicate that information and the argument line is the preferred place to keep it. That way we don't have to worry about maintaining the values in two places.

I'll refactor docstring to follow NumPy/pandas style without type/default duplication.

timsaucer · 2026-02-04T12:10:17Z

python/datafusion/dataframe_formatter.py

+        """
+        self._max_rows = value
+
+    @property


If repr_rows is being deprecated, why add an accessor?

I added the accessors for backward compatibility during the deprecation period:

Rationale:

User code may directly access the property: Code like formatter.repr_rows = 20 continue working during the deprecation period

Graceful migration path: Users get a warning but their code doesn't break

Custom formatter implementations: External code that inherits from the formatter and accesses repr_rows directly will continue to work

Shall we keep the accessors for now with the deprecation warnings, plan removal in next major version?

timsaucer · 2026-02-04T12:10:43Z

python/datafusion/dataframe_formatter.py

+    @repr_rows.setter
+    def repr_rows(self, value: int) -> None:
+        """Set the maximum number of rows using deprecated name.
+


Same, why add for deprecated?

timsaucer · 2026-02-04T12:13:36Z

python/tests/test_dataframe.py


-def test_html_formatter_repr_rows(df, clean_formatter_state):
-    configure_formatter(min_rows_display=2, repr_rows=2)
+def test_html_formatter_memory_boundary_conditions(df, clean_formatter_state):


Maybe switch to large_df instead of df here?

Great suggestion! The large_df fixture (100,000 rows) is much better suited for testing memory boundary conditions than the standard df

timsaucer · 2026-02-04T12:14:30Z

python/tests/test_dataframe.py

+
+    # Get the raw size of the data to test boundary conditions
+    # First, capture output with no limits
+    configure_formatter(max_memory_bytes=10 * MB, min_rows_display=1, max_rows=100)


If you do switch to large_df then I think it may go above the 100 limit you have

I'll adjust this higher.

timsaucer · 2026-02-04T12:17:23Z

python/tests/test_dataframe.py

+    unrestricted_rows = count_table_rows(unrestricted_output)
+
+    # Test 1: Very small memory limit should still respect min_rows
+    configure_formatter(max_memory_bytes=10, min_rows_display=1)


I think a better test is one where we do hit the memory limit well before we hit the min number of rows, hence the recommendation to switch to large_df. Actually, maybe we want a different dataframe, one where we know we have multiple batches instead of a single batch. The thing this isn't doing is verifying we've ended the stream early, but I think that would have to be a rust test instead of a pytest.

I'll create tests for early stream termination behavior with multi-batch DataFrames.

timsaucer · 2026-02-04T12:21:24Z

src/dataframe.rs

    let max_bytes = get_attr(formatter, "max_memory_bytes", default_config.max_bytes);
    let min_rows = get_attr(formatter, "min_rows_display", default_config.min_rows);
-    let repr_rows = get_attr(formatter, "repr_rows", default_config.repr_rows);
+    let max_rows = get_attr(formatter, "max_rows", default_config.max_rows);


Since users may have provided their own implementation for the formatter, I think we need to first try getting max_rows. If that fails, try getting repr_rows. If that fails, take default. When we remove repr_rows entirely after it's been deprecated for a few releases, then we can revert to this simpler logic.

I'll add the fallback for repr_rows.

…date related tests

Removed type annotations and redundant default values from parameter names. Enhanced descriptions for clarity and added context for usage. Fixed formatting for the documentation sections to improve readability.

…on with memory limits

…patibility test

kosiew added 6 commits February 4, 2026 14:52

Update DataFrameHtmlFormatter to enforce min_rows_display constraint …

fa9f257

…and adjust default values

Refactor DataFrame formatter to replace repr_rows with max_rows and u…

0563f6c

…pdate related validations

Add validation for formatter parameters and deprecate repr_rows alias

168eda8

Add boundary condition tests for HTML formatter memory limits and res…

0ad2621

…olve max_rows logic

Remove repr_rows handling in max_rows resolution in Rust

a7dfd3f

Refactor whitespace in parameter validation and update test for HTML …

61db037

…formatter memory limits

kosiew changed the title ~~Fix DataFrame display memory limit by introducing max_rows and enforcing min_rows_display <= max_rows~~ Enforce DataFrame display memory limits with max_rows + min_rows constraint (deprecate repr_rows) Feb 4, 2026

ruff fix

69bcf6f

kosiew marked this pull request as ready for review February 4, 2026 09:29

kosiew self-assigned this Feb 4, 2026

timsaucer requested changes Feb 4, 2026

View reviewed changes

kosiew added 8 commits February 5, 2026 15:32

Rename min_rows_display to min_rows in formatter configuration and up…

399e3e2

…date related tests

Refactor function parameter handling and documentation

af3ef4b

Removed type annotations and redundant default values from parameter names. Enhanced descriptions for clarity and added context for usage. Fixed formatting for the documentation sections to improve readability.

Update HTML formatter memory boundary tests for large datasets

af11824

Enhance memory boundary tests in HTML formatter for large datasets

1085992

Add fixture for multi-batch DataFrame and test early stream terminati…

6f92b3d

…on with memory limits

Add backward compatibility tests for deprecated formatter attributes

894bac0

ruff fix

a835ceb

Remove deprecation timeline comments from HTML formatter backward com…

97e2b6d

…patibility test

		min_rows_display=20, # Minimum number of rows to display
		repr_rows=10, # Number of rows to display in __repr__

		min_rows_display=50, # Always show at least 50 rows
		repr_rows=20 # Show 20 rows in __repr__ output

Conversation

kosiew commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

LLM-generated code disclosure

Uh oh!

timsaucer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kosiew commented Feb 4, 2026 •

edited

Loading